Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18893: Rechecking pending Pods (conflict resolved) #196

Conversation

nicklesimba
Copy link
Contributor

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

Solution description, as written on xagent003's upstream PR:

"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.
Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"

Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 12, 2023
@openshift-ci-robot
Copy link
Contributor

@nicklesimba: This pull request references Jira Issue OCPBUGS-18893, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

Solution description, as written on xagent003's upstream PR:

"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.
Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"

Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb
Copy link
Contributor

maiqueb commented Sep 22, 2023

As spoken offline, this band-aid will cause stress on the API, thus impact workloads on other networks, including the cluster default network; the code should be refactored to rely on informers ASAP.

Having said that, this issue with pending pods is real, and is addressed by this PR.

@dougbtv we want to fix the pending pod issue, regardless, right ?

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 22, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maiqueb, nicklesimba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 22, 2023

@nicklesimba: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 7478beb into openshift:master Sep 22, 2023
@openshift-ci-robot
Copy link
Contributor

@nicklesimba: Jira Issue OCPBUGS-18893: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18893 has been moved to the MODIFIED state.

In response to this:

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

Solution description, as written on xagent003's upstream PR:

"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.
Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"

Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-09-27-073353

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2024-01-13-050900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants